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A.  INTRODUCTION 


The  vector  processor  has  reopened  for  research  the  study 
of  general  sparse  matrix  algorithms.  Such  study  can  be  divided  into 
the  following  classifications. 

1.  Block-oriented  methods  that  exploit  vectors  within  dense 
substructures . 

2.  Simultaneous  system  methods  that  exploit  global  structural 
symmetries . 

3.  Fast  scalar  methods  (a  new  study) ,  for  parts  of  a  system 
for  which  the  above  methods  cannot  be  used. 


General  sparse  solvers  tend  to  be  used  in  the  following  types 
of  applications. 

1.  ODE/algebraic  systems  such  as  electrical  circuits  and  rigid 
body  dyneimics,  where  the  random  nature  of  the  structural 
regularities  requires  general  sparse  solvers.  The  diffi¬ 
culty  here  is  that  vectorization  may  require  major  modifica¬ 
tions  in  the  equation  formulation  by  the  user. 

2 .  Finite  element  and  finite  difference  systems ;  here  they  must  ooipete 
with  solvers  written  for  specific  structures  such  as  block 
triadiagonal  and  banded  systems.  With  the  ability  to  solve 
arbitrary  structures,  they  are  excellent  for  vector  processor 
and  algorithm  evaluation  before  structurally  dedicated  assem¬ 
bly  coded  solvers  can  be  written.  It  must  be  acknowledged 
however,  that  vector  processors  favor  dense  systems  and  dedi- 
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With  these  realizations,  current  research  is  oriented  toward 
solutions  of  ODE/algebraic  applications.  However,  opportunities  for 
use  of  general  solvers  in  solution  of  finite  difference  and  finite 
element  grids  will  certainly  be  exploited. [2] 


B.  PROGRESS 

1.  Locally  dense  systems.  A  block-oriented  general  sparse 
equation  solver  that  exploits  cache  memory  on  the  CRAY-1  has  been 
prepared  and  documented  13];  comparisons  are  now  being  made  between 
its  performance  and  solvers  for  specialized  common  structures.  One 
performance  generalization  that  can  be  made  is  that  the  general  solver 
devotes  approximately  half  of  its  time  to  list  processing  for  square 
blocks  of  size  3;  for  larger  blocks,  the  numeric  processing  kernels 
predominate. 

2.  Simultaneous  systems.  The  concept  and  application  of  a 
simultaneous  general  sparse  solver  was  first  ennunciated  in  [1]  for 
VLSI  circuit  analysis  and  in  [2]  for  solution  of  2-D  finite  difference 
grids.  The  performance  figures  given  were  based  on  an  assembly  lan¬ 
guage  (CAL)  implementation.  A  new  approach,  patterned  after  the  work 
of  Gustavson  14],  is  now  being  studied,  where  explicit,  loopless 
vector  code  specific  to  a  sparsity  pattern  is  first  generated  to  solve 
across  identically-structured  sparse  systems.  The  major  objection  to 
Gustavson 's  original  scalar  code  generation  scheme  was  the  code  storage 
required  to  solve  large  systems.  However,  because  each  vector  instruc¬ 
tion  now  performs  up  to  64  floating  point  operations  in  a  simultaneous 
sparse  solver,  the  code  storage  requirements  are  potentially  (1/64) th 
that  of  the  equivalent  scalar  code.  A  code  generation  version  of  our 
original  CAL  simultaneous  sparse  solver  is  now  under  development. 


3.  Fast  scalar  solvers.  Sparse  systems  inevitably  have  (hope¬ 


fully  small)  components  not  amenable  to  vectorized  solution.  In 
VLSI  circuit  analysis,  for  example,  this  component  corresponds  to  the 
interconnection  network.  Experience  indicates  that  a  small  scalar 
computational  component  can  become  important  if  care  is  not  taken 
in  its  coding. 

Fast  scalar  equation  solvers  have  been  developed  to  execute  on 
the  CRAY-1  in  the  range  of  12-20  MFLOPS.  For  example,  a  seminar 
student  at  Michigan  developed,  by  careful  instruction  scheduling  with 
our  CRAY-1  simulator,  a  scalar  tridiagonal  solver  that  outperforms 
vectorized  cyclic  reduction  codes  (5] .  To  replace  the  inter¬ 
active  instruction  scheduling  performed  by  the  student,  we  are 
examining  the  use  of  the  CPM  for  automatically  scheduling  explicit 
scalar  code  generated  in  the  manner  of  Gustavson  (4] .  CPM  is  used 
for  instruction  scheduling  in  the  Fortran  compilers  of  both  the  CRAY-1 
and  the  CYBER  203/205,  so  that  its  specialization  to  sparse  matrix 
codes  should  not  be  limited  to  the  CRAY-1  in  application.  The  instruc- 
tionscheduler  should  also  provide  an  interesting  adjunct  to  our  CRAY-1 
cross  assembler/simulator  software  (see  section  5  below) . 

4.  VLSI  design.  A  PhD  student  from  UC  Berkeley  engaged  in 
electronic  circuit  simulation  was  a  visitor  to  our  research  group 
during  the  summer  of  1980  to  study  the  CRAY-1  and  our  sparse 
matrix  research.  Under  subsequent  Bell  Laboratories  auspices,  he 
produced  the  paper  of  Appendix  A.  The  sparse  solver  he  developed 
generates  explicit  unoptimized  scalar  code;  despite  this  ineffic¬ 
iency,  the  sparse  matrix  solution  dropped  in  significance  (visa 
yis  the  original,  Fortran-coded  solver)  to  21%  of  the  total  analysis 
time.  We  have  consequently  redirected  1-2  months  of  effort  to 


assisting  him  in  the  vectorization  of  the  now  dominant  equation 
formulation  (modeling)  stage  of  the  circuit  analysis.  His  sparse 
matrix  code  has  spawned  the  abov'e-mentioned  research  on  scheduling, 
however. 

5.  Simulator  development .  There  is  increased  interest  in  our 
assembler/simulator  package*  among  CRAY-l  users  who  are  dedicated  to 
developing  higher  performance  codes.  Because  we  benefit  from  inter¬ 
action  with  such  sophisticated  groups,  we  are  devoting  1-2  man-months 
to  remove  local  operating  system  dependencies  of  the  software  that 
inhibit  conversion  to  other  systems.  Also,  we  are  promised  early  access 
to  serial  #1  of  the  CRAY-2,  which  is  expected  to  have  an  instruction 
set  similar  to  the  CRAY-1.  The  simulator  will  then  be  modified  accord¬ 
ingly  and  will  become  even  more  useful  to  us  and  to  others  as  an  evalu¬ 
ation  and  code  development  tool, 

C.  COUPLING  ACTIVITIES 
A.  Seminars 

1.  On  vectorized  sparse  elimination. 

a.  Washington  State  University,  November  17,  1980 

b.  University  of  California,  Berkeley,  November  19,  1980 

c.  University  of  Texas,  Austin,  October  30,  1980 

2.  On  other  topics  in  vector  processing. 

a.  4-day  seminar  at  AFFDL,  June  1980 

b.  Seminar  at  LANL,  August  1980 


*Mitsubishi  (Japan),  Daresbury  (UK),  Exxon,  Univ.  of  Alabama 


B.  Counsulting 

1.  Visiting  scientist,  AFPDL,  to  give  instruction  on  algorithms 
for  vector  processing,  and  to  study  I/O  problems  associated 
with  Navier-Stokes  codes  on  the  CYBER  203/205. 

2.  Visiting  scientist,  LANL,  on  vectorized  Monte  Carlo. 

3.  Industrial  consultant,  Mobil  Research  and  Development,  on  the 
vectorization  of  3-D  diffussion  codes  associated  with 

oil  reservoir  drilling  and  management. 

C .  Other 

A  one-week  short  course  at  the  University  of  Michigan  on  High 
Speed  Computation  was  taught  in  August,  1980.  Among  44  attendees 
were  representatives  from  NRL  and  DOD. 
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ABSTRACT: 

The  alitoriUuns,  data  atruclurca  and  semleccductor 
dcvice  modeling  technique!  proposed  for  a  nev  simula¬ 
tion  program  for  LSI  and  VLSI  eircuiti  arc  presented. 
The  selected  algorithms  and  modeling  techniques  are 
designed  tor  efficient  ezecuUon  on  vector  computers. 

1.  Introduction 

The  SPICE  [l]  program  has  had  wide  acceptance 
and  experience  for  integrated  circuit  evaluation  during 
the  last  ten  years.  Although  initially  designed  to  simu¬ 
late  efficiently  circuits  of  up  to  one  to  two  hundred  ele¬ 
ments.  SPICE  2G  [2]  is  presently  used  to  analyze  cir¬ 
cuits  which  are  one  to  two  orders  of  magnitude  larger. 
The  run-lime  [or  a  circuit  of  tliis  extent  is  measured  in 
days  on  the  VAX  11/780.  hours  on  a  CYDER  and  lens  of 
minutes  on  the  CRAY-1.  In  .spile  of  an  inherent  increase 
in  the  execution  speed  of  SP1CE2  on  now  main-frame 
computers,  an  additional  order  of  magnitude  increase 
in  speed  is  needed  for  the  efficient  simulation  of  LSI  cir¬ 
cuits. 

Several  papers  have  recently  been  published  on 
algorithms  and  approaches  for  the  computer-aided  cir¬ 
cuit  analysis  of  LSI.  However  only  two  working  circuit 
simulation  programs  [3.4]  have  been  reported  to  dale 
and  neither  provides  the  desired  speed  imp.-ovement 
for  LSI  circuits. 

The  different  design  aspects  of  the  new  program 
are  outlined  in  the  following  sections.  The  algoriltims 
and  appropriate  data  structures  are  described  first. 
The  modeling  techniques  used  in  timing  simulation  c.m 
be  adapted  efficiently  to  the  new  program  as  stiown 
next.  Comments  on  the  initial  results  obtained  from 
experiments  with  SPICE  2C  on  the  CRAY-1  arc  presented 
in  conclusion. 

2.  Algorithms 

Two  basic  factors  of  present  technology  are  con¬ 
sidered  in  the  design  of  the  new  LSI  circuit  simulator.' 
The  first  one  is  that  LSI  circuits  are  usually  a  collection 
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of  a  limited  number  of  structurally  identical  functional 
blocks  such  as  logic  gales,  operational  amphners.  etc. 
The  second  factor  is  the  advent  of  vector  computers 
which  provide  the  ideal  architecture  for  fast  compula¬ 
tions  on  repetitive  structures.  SP1CE2  operates  on  an 
entire  circuit  matrix  which  is  loaded  element  by  ele¬ 
ment.  The  analysis  in  the  new  program  is  done  at  the 
functional  block  (subcircuit)  level.  However,  only  direct 
methods  arc  used  m  the  solution  procedure,  i.e.  no 
decoupling  lakes  place  at  any  level  as  in  liming  or 
mixed-mode  simulation  [b]  However  the  structured 
inout  description  is  used  to  perform  an  important  part 
..  compulation  in  parallel.  Therefore,  the  new  pro¬ 
gram  g.  lups  the  identical  functional  blocks  together 
and  performs  a  two-lcvel  analysis  [b]  ich  funclioi.u. 
block  (s-’bcircuil)  generates  a  diagonal  submalrix  in  the 
overall  circuit  matrix.  Thus  the  circuit  can  be 
represented  as  a  Bordered  Block  Diagonal  Form  (DBD:') 
matrix  At  the  lower  level  there  ore  the  diagonal  sub- 
matrices  representing  the  internal  nodes  of  each  sub- 
circuil  and  at  the  upper  level  there  is  the  in' erconnec- 
lion  matrix  (borders  and  lower-right  corner).  The 
linearization  and  solution  of  all  subcircuits  of  the  same 
type  can  be  done  simultaneously  in  the  vector-mode. 
This  is  possible  because  all  these  subcircuits  have  the 
same  topology.  The  feature  of  parameter  passing 
between  subcireuit  call  and  subcircuil  defu-.it.on 
increases  the  occurancc  of  topologically  identical  sub- 
circuits  This  feature  provides  the  possibility  of  drft,  mg 
a  diffcriTil  element  value,  device  or  model  parameter  at 
each  instance  of  the  same  subcircuil. 

The  linear  system  of  equations  is  solved  by  rxecul- 
ing  generated  vector  code.  This  approach  is  used 
because  an  important  part  of  the  total  execution  lime  is 
spent  in  equation  solution.  One  set  of  vector  code  is 
generated  symbolically  for  each  type  of  subcircuil.  The 
same  code  is  then  used  for  the  l.U  factorization,  forward 
and  backward  substitution  of  all  subcircuils  of  the  same 
type.  This  approach  also  reduces  the  severity  of  the 
gather/scaller  problem  due  to  provisions  taken  in  the 
design  of  the  data  structure,  as  described  below. 


3.  Data  Structure 

The  dynamic  memory  management  icheme  o( 
SP1CE2  i>  preserved  (or  handling  tlic  circuit  data.  How¬ 
ever,  the  data  structure  which  is  predominantly  of  the 
Unked-list  type  in  SPICES  is  changed  The  data  structure 
and  the  main  analysis  loop  are  designed  to  accommo¬ 
date  the  new  algorithms  and  to  make  the  vectonzation 
possible.  Thus,  each  element  type  has  its  own  parame¬ 
ter  table:  and  data  may  be  aligned  in  such  a  way  that 
vector  mode  operations  can  be  programmed  most 
edectively.  The  information  on  the  circuit  sparse 
matrix  is  organized  as  one  set  of  sparse  submatriz 
pointers  for  each  type  of  subcircuit  and  another  set  of 
pointers  for  the  interconnection  matrix.  The  nonzero 
subniatrix  entries  and  RHS  (or  each  subcircuit  of  the 
same  type  are  aligned  and  appear  as  one  dataset. 

4.  Device  Models 

Model  evaluation  is  subject  to  two  constraints. 
First,  it  must  provide  a  most  accurate  representation  of 
devices  with  continuously  changing  features  due  to  fast 
progress  in  processing.  Second,  it  must  be  coded  in 
such  a  way  as  to  allow  vectorizalion  by  a  Fortran  com¬ 
piler.  Both  requirements  can  be  fulAlled  if  look-up 
tables  (or  device  data  are  used  instead  of  mathematical 
models  (7).  The  accuracy,  speed  and  ease  of  implemen¬ 
tation  are  some  of  the  considerations  underlying  the 
selection  of  this  modeling  appro.ich  for  circuit  simula¬ 
tion  (5].  The  analytical  models  available  in  SPICE  2C  can 
be  used  os  well  if  recoded  (or  vectorization  purposes 
Some  of  the  simpler  models  will  be  rewritten  in  order  to 
compare  their  performance  with  the  table  models. 
Another  important  aspect  of  the  linearization  of  sem¬ 
iconductor  devices  is  internal  node  suppression  [Bj 
Internal  nodes  are  generated  due  to  the  existence  of 
series  parasitic  resistances  at  the  device  terminals. 
Thus  for  a  circuit  with  000  nodes  and  1000  bipolar 
transistors  if  all  the  parasitic  resistances  for  the 
emitter,  base  and  collector  terminals  are  spcciflcd.  the 
circuit  matrix  will  have  an  artificial  dimension  of  3B00 
rather  than  the  BOO  if  node  suppression  is  used. 

For  portability  reasons  the  major  part  of  the  new 
program  is  coded  in  Fortran  with  restrictions  required 
by  vector  compilers  in  order  to  generate  vectorized 
code.  The  linear  equation  solution  part  is  coded  in 
assembler  language  to  obtain  important  speediips  of 
computation. 

5.  Performance  Estimate 

An  estimate  of  the  performance  of  the  new  pro¬ 
gram  can  be  made  from  various  experiments  with  SPICE 
2G  on  the  CRAY-1.  A  bipolar  NAND  gate  “t-bit  adder  as 
shown  in  Fig.  1  is  used  as  a  test  circuit  (this  example 
can  be  easily  expanded  to  four  or  eight  times  its 


complexity  becoming  a  IB-  or  32-bit  adder,  reipec- 
tively).  Ttie  circuit  contains  2BS  semiconductor  devices 
(IBO  BJTs  and  lOS  diodes)  and  is  described  by  a  4S0*4b0 
matrix.  If  node  suppression  is  implemented  the  site  of 
the  matrix  will  be  only  270*270  (an  additional  node  is 
generated  by  the  base  resistance  of  each  of  the  1B3 
Bi’Ts). 
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An  interesting  observation  is  the  time  breakdown 
between  model  evaluation  and  equation  solution  on  the 
CRAY-1.  On  scalar  computers  (or  all-Fortran  versions 
the  breakdown  between  the  two  is  roughly  705/30?  for 
a  50  equation  circuit.  The  CKAY-1,  a  pipelined  machine, 
is  more  cAicicnt  in  evaluating  models  rather  than  trac¬ 
ing  through  linked  lists.  Thus,  with  an  all-Fortran  ver¬ 
sion  approximately  half  of  the  lime  is  spent  in  the  solu¬ 
tion  part.  For  a  50  equation  circuit  505.  and  for  a  450 
cquotion  circuit  605.  of  the  total  lime  are  spent  solving 
the  linearized  equations. 

An  overall  (actor  of  two  improvement  has  been 
obtained  just  by  using  generated  loopless  machine  code 
[9,10]  for  the  equation  solution  (a  factor  of  B  for  this 
part  only).  Some  meaningful  statistical  data  on  the 
transient  analysis  of  the  circuit  in  Fig.  1  are  summar¬ 
ized  in  Table  1.  The  two  analyses  of  Table  1  used  Qve 
minutes  of  CRAY-1  epu  lime. 
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The  generated  code  only  uses  the  scalar  registers 
in  this  case.  In  the  new  program  there  will  be  identical 
code  templates  which  use  vector  registers.  The  compu- 
totion  speed  achieved  with  the  scalar  version  is  2.5 
Mflops.  It  is  estimated  the  vector  version  will  average 
in-20  Mflops  at  the  entire  matrix  level  (subcircuit  + 
interconnection). 

6.  CoDcluaion 

The  approach  of  carefully  matching  the  charac¬ 
teristics  of  the  circuits  to  be  analyzed  with  the 
appropriate  algorithms  and  machine  architecture  make 
it  possible  to  extend  circuit  simulation  to  LSI  and  VLSI 
circuits.  One  order-of-magnilude  increase  in  speed  per¬ 
formance  is  expected  from  the  new  program.  This 
brings  the  program  in  the  same  performance  category 
as  an  existing  timing  simulator  running  on  the  CR^Y-1. 
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